Eurasian Journal of Educational Research, Issue 58, 2015, 41-60 


A Study on Detecting of Differential Item Functioning of 
PISA 2006 Science Literacy Items in Turkish and American 

Samples 

Niikhet giKRIKgi DEMiRTA§LI * 
Seher ULUTA§*‘ 


Suggested Citation: 

C/ikrikgi Demirta§li, N. & Ulu§ta§, S. (2015). A Study on Detecting of Differential Item 
Functioning of PISA 2006 Science Literacy Items in Turkish and American 
Samples. Eurasian Journal of Educational Research, 58, 41-60. 

http://dx.doi.org/10.14689/ejer.2015.58.3 


Abstract 

Problem Statement: Item bias occurs when individuals from different 
groups (different gender, cultural background, etc.) have different 
probabilities of responding correctly to a test item despite having the same 
skill levels. It is important that tests or items do not have bias in order to 
ensure the accuracy of decisions taken according to test scores. Thus, items 
should be tested for bias during the process of test development and 
adaptation. Items used in testing programs, such as the Program for 
International Student Assessment (PISA) study, whose results are inform 
educational policies throughout the participating countries, should be 
reviewed for bias. The study examines whether items of the 2006 PISA 
science literacy test, applied in Turkey, show bias. 

Purpose of the Study: The aim of this study is to analyze the measurement 
equality of the PISA science literacy test of 2006 in Turkish and American 
groups in terms of structural invariance and also determined whether the 
science literacy items show inter-cultural bias. 

Methods: The study included data for 15 year-old 757 Turkish and 856 
American students. Exploratory factor analysis (EFA) and confirmatory 
factor analysis (CFA) was performed to determine whether the PISA 
science literacy test was equivalent in measurement construct in both 
groups; multi group confirmatory factor analysis (MCFA) was used to 
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identify differences in the factor structure according to cultures. Item bias 
was detected via the Mantel-Haenszel (MH), Simultaneous Item Bias Test 
(SIBTEST) and Item Response Theory Likelihood- Ratio Analysis (IRT-LR) 
procedures. 

Findings and Results: : According to the MCFA results PISA 2006 science 
literacy test for both Turkish and American groups showed equivalent 
measurement construct. Moreover, the three analyses methods agreed at B 
and C levels for 15 items in the Turkish sample and 25 items in the 
American sample in terms of DIF. According to expert opinions, common 
sources for item bias were: familiarity with item content and differing skill 
levels between cultures. 

Conclusions and Recommendations: The 38 items that showed DIF by each of 
the three methods were accepted as having DIF. The findings of the 
present study, possible source of bias in the items will not change the 
average level of student performance in participating countries. Flowever, 
it will be beneficial that the review of item content before test 
administration, in order to reduce the errors items with DIF across 
different language and cultural groups in international comparative 
studies. 

Keywords: PISA, DIF, Mantel-Haenszel, SIBTEST, IRT-LR 

Bias is the presence of some characteristic of an item that results in differential 
performance for individuals of the same ability in terms of measuring trait but from 
different ethnic, sex, cultural, or religious groups. In other words, an item biased if 
equally able (or proficient) individuals, from different groups, do not have equal 
probabilities of answering the item correctly. This situation results from some 
features of items or various situations which are irrelevant with the purposes of the 
test. Bias is a systematic error affecting the validity of test scores. (Angoff, 1993; 
Hambleton and Rodgers, 1995; Ellis & Raju, 2003; Reynolds, Livingston & Wilson, 
2006). 

Items should be tested for potential bias during test construction and adaptation 
in order to ensure the accuracy of decisions that will be based on the test scores. 
Methods of determining item bias focus on the validity of test items between 
particularly different sub-groups (Shepard, Camilli & Williams, 1985). Different 
methods are used in determining item bias according to classical test theory (CTT) 
and item response theory (IRT). Within the CTT, many researchers investigated bias 
by comparing groups via classical statistics such as arithmetic means or item-test 
correlation. The item bias results obtained by classical methods can vary according to 
groups, and therefore cannot be generalized to other groups. Thus, researchers have 
adopted the implicit features model (Embretson & Reise, 2000; Hambleton, Clauser, 
Mazor & Jones, 1993) .In literature on psychometrics, some suggestions were made to 
use a term other than bias for the statistical observation, quite part from its 
judgmental or interpretive meaning and use, and another term to describe the 
judgement and evaluation of bias in social sense. Finally the expression differential 
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item functioning came into use, referring to the simple observation that an item 
displays different statistical properties in different group settings (after controlling 
for differences in the abilities of the groups) (Angoff, 1993, p.4) 

In DIF analysis, the performance of two groups whose skill/competence levels 
are matched/equivalent is compared for each item. The primary group considers as 
the focus group and the other is the reference group, which is the basis of the 
comparison (Donoghue, Holland & Thayer, 1993). Conducting DIF analysis by IRT 
involves comparison of parameter values estimated from these two groups and the 
areas between the item characteristic curves estimated from the two groups. In IRT, 
the item characteristic curve gives a graphical representation of the mathematical 
function of the correct response pattern and skill measured by items in the test. When 
the item characteristic curves of an item are not the same for reference and focus 
groups, the item doesn't measure that proficiency (or ability) similarly in both 
groups, and hence shows DIF. Item can be interpreted as biased since item 
characteristic curves will become different when the difference between item 
parameter values increases (Osterlind, 1983; Camili & Shepard, 1994, Zumbo, 1999; 
Embretson & Reise, 2000; Baker, 2001). 

Bias determination methods based on CTT have advantages and disadvantages 
relative to IRT (Camilli & Shepard, 1994; Thissen, 2001). Studies generally perform 
several methods in combination, because previous studies have shown differing 
outcomes between different tests (Acar, 2008; Ate§ok Deveci, 2008; Benito & Ara, 
2000; Bertnard & Boiteau, 2003; Dogan & Ogretmen, 2006; Skaggs &Lissitz, 1992; 
Welkenhuysen-Gybels & Billiet, 2002; Yildinm, 2006; Bakan Kalaycioglu, 2008; 
Yildirim & Berberoglu, 2009). In the present study, the potential for bias within the 
PISA 2006 science literacy test was investigated by three different methods. 

PISA results are taken into consideration by educational policy makers around 
the world. PISA determines the proficiency of students with 15 year-old in 
mathematics, science and reading skills at international level. PISA focuses on the 
competency to use knowledge and skills to overcome difficulties faced in daily life. 
PISA studies have been conducted at three-year intervals since 2000, and one of 
mathematics literacy, science literacy and reading skills areas is determined as 
dominant area in each application period (MEB, 2007; OECD, 2005; MEB, 2010). 

Previous studies have reported that the items used in international evaluation 
studies such as PISA can be subject to bias resulting from translation, adaptation, 
differences in education programs, etc. (Ercikan, 2002; Ercikan, Me Creith & 
Lapointe, 2005; Yildirim & Berberoglu, 2009; Le, 2009). The original PISA test was 
developed in English and translated into the language of participating countries. 
Thus, language is the most important cultural factor leading to test bias. In this 
study, whether the items in the PISA 2006 science literacy test conducted in Turkey 
have any bias suspicion is investigated. The purpose of the research is to determine 
equality of intercultural (the USA and Turkey) measurement structure of items used 
in science literacy test in PISA 2006 study as well as the items having bias suspicion 
from the items used in science test and possible bias reasons by using statistical and 
judgmental approaches. 
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Method 

The following methods were employed to the research test. 

Population and Sampling 

Approximately 400,000 students which were included randomly in sampling for 
representing 20 million students at 15-years old from 57 countries participated to 
PISA 2006 study. A two-stage stratified sample design was used for the PISA 
assessment . The first-stage sampling units consisted of schools having 15-year-old 
students. These schools had selected randomly from seven region in Turkey. Once 
schools were selected to be in the sample, a complete list of each sampled school's 15- 
year-old students was prepared. The second-stage sampling units were 15 year-old 
students within sampled schools. As a result, the Turkish data obtained from 4942,15 
year-old student in 160 schools (OECD, 2005; MEB, 2008). 

In this study, the data was used which obtained from 856 American and 657 
Turkish students who completed booklet 1 and booklet 5 in the PISA 2006 science 
literacy test. 

Since most of the items in these booklets were released to study by PISA 
consortium, the booklets were chosen for this study. These data retrieved from 
offical PISA web site. 

Measures 

PISA 2006 science literacy test, which was developed by OECD, as measurement 
instrument. The PISA test and questionnaires measures higher-order thinking skills 
such scientific process skills and attitudes towards science. In the PISA test, 
approximately 40% of items are open ended, 8% are short answers and 52% are 
multiple-choice. Booklet land booklet 5 includes respectively 58 and 60 science 
literacy items. Of these items, 23 were released; 15 were multiple-choice questions, 
and 8 were open-ended items (MEB, 2007; MEB, 2010). 

Data Analysis 

Multiple-choice items were scored as 0-1 and open-ended items were scored as 
0-1-2. When using suitable parameter (models for dichotomous items) estimations 
for items scored with two categories, partial correct and full correct answers were 
accepted as correct answers and scored byl. in items scored as lve 2. Wrong, blank, 
inaccessible or invalidly marked answers, for example those where more than one 
option was marked, were coded with 0 as an incorrect response. 

Exploratory factor analysis (EFA) was used to determine dimensionality and 
factor structure of PISA Science literacy test in American and Turkish samples. EFA 
is generally used to evaluate factor structures or dimensionality of tests in scales and 
tests (Gierl, 2000; Bolt & Ysseldyke, 2006; (Jet, 2006, Yildirim, 2006). For this purpose, 
both Principal axis factoring (PAF) and Principal Component factor (PCF) analyse 
methods were applied on data in order to find a statistical evidence for 
dimensionality of PISA science literacy tests in each group. The results of PAF 
showed higher explained total variance for first factor than that provided by PCF 
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method and also much more items (41items) were loaded under first factor in result 
of analysis of PAF method. These findings were considered as an evidence for 
unidimensionality in this study. According to the results, Pisa science literacy test 
gives a dominant one dimension which has eigenvalues 16,660 for first factor and 
there was big difference between 1st factor and 2 nd factor (eigen value 1,869). 

As a second pre-analysis. Confirmatory factor analysis (CFA) was used to prove 
unidimensionality of the PISA tests and to determine whether the factor structure 
differs between groups. CFA issued in international studies of factor structures 
between groups and unidimensionality (Gierl, 2000). Covariance matrices were 
created in SPSS for CFA via the PRELIS program. The existence of unidimensional 
structure was controlled for each group and booklet (test) using covariance matrices 
in the LISREL program (Joreskog & Sorbom, 1993; §im§ek, 2007). Many studies 
(Ercikan & Kim 2005; (Jet, 2006; Yildirim, 2008) used multi group confirmatory factor 
analysis (MCFA) to determine the equivalence of factor structures of tests developed 
for different cultures. MCFA was used to determine whether the factor structures of 
PISA Science Literacy test differed with respect to Turkish and American samples. 

In this study, one of the DIF analyses was performed as IRT based. Before DIF 
analysis, PISA data was tested according to IRT basic assumptions; 
unidimensionality, local independency and model-data fit. In respect to IRT 
assumptions, data should be one-dimensional structure (Hambleton & Swaminathan, 
1985; Gierl, 2000). That's why, the result of PAF method which presented in previous 
paragraph which was considered as an evidence assumption of unidimensionality in 
IRT for PISA science literacy test.. In context of PAF results, the eigen values first 
and second factor were was found respectively, (16.660) and second factor (1.869) 
and there was small difference between the eigenvalues of the second factor, and 
third one and the rest (Hambleton & Swaminathan, 1985; Gierl, 2000). Since the PISA 
data met unidimensionality assumption, another IRT assumption local independency 
was accepted for the PISA 2006 science literacy test data. (Hambleton, Swaminathan 
& Rogers, 1991; Osterlind, 1983). In addition to these analyses, PISA data were tested 
using one-, two- and three-parameter IRT models via the BILOG-MG program in 
terms of model-data fitting test. The two-parameter model showed best fitting with 
the data, which had the largest number of items with chi-square value > 0.05. 

Mantel-Haenszel (MH) Method. In the MH method and DIF analysis, the 
performance of two groups was compared by total points (Benito &Ara, 2000; Dorans 
& Holland, 1992; Donoghue, Holland &Thayer, 1993). The MH D-DIF value, which 
showed the extent to which the items in tests comprised DIF, was classified 
according to three categories: A minimal level; B middle level; and C high level. If the 
item is in category A, MH D-DIF value is zero or less than 1. If the item is in category 
C, its MH D-DIF value is both bigger than 1.5 and its statistical significance should be 
more than 1.0. MH D-DIF value between these values is in category B (Dorans & 
Holland, 1992). During MH analysis, the total scores of the American and Turkish 
groups were calculated and categorized according to 20% percentile bands. These 
categories were then used in the EZDIF program developed by Waller (2005). 
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SIBTEST Method. In the SIBTEST method, items are allocated to two sub-tests: the 
focus group, comprising items with potential DIF; and the reference group, 
comprising items not having DIF. For each sub-test point, linear regression is used in 
order to estimate subtest true scores compared within the scope of "k" focus and 
reference groups. Estimated true scores are arranged using regression verification 
techniques (Abbott, 2007; Gierl, Khalig & Boughton, 1999). The following formula 
gives differences in weighted average between focus and reference groups for 
subtest item or item clusters examined among k number of subgroups (Abbott, 2007): 

k 

PuNI = PkPk 
k =0 

Here, Pk is the proportion of focus groups in k number of subgroups; dk is the 
difference in adjusted means of item cluster or studies sub-test item for reference and 
focus groups, respectively, in each k number of sub-groups. If the significance level 

of J3 UN1 is positive, DIF is for the reference group; if negative, DIF is for the focus 
group (Abbott, 2007; Stout, Bolt, Froelich, Habing, Hartz & Roussos, 2003; Zhou, 
Gierl & Tan, 2005). The value of P UNI obtained from an item in SIBTEST analysis 

was classified as follows according to the presence of DIF (Abbott, 2007; Gierl et al., 
1999; Gotzmann, Wright & Rodden, 2006): unless there is DIF, the absence 
hypothesis cannot be rejected and |y 3 ( w | is close to zero. When DIF is negligible or at 

level A, <0.059 an d H 0 :J3 = 0 is rejected. When DIF is at medium level or at 
level B, q Q 59 < | < 0 088 anc ^ //„:/? = 0 is rejected. When DIF is at significant level 

or at level C, the value |>o 088 an ^ // 0 :/? = 0 is rejected. 

IRT-LR procedures. The IRT-LR method uses a test of statistical significance to 
compare the differences between two models: compact model (C) and augmented 
model (A). The purpose of the method is to test whether additional parameters in the 
augmented method differ from zero. The formula of likelihood rate is as follows: 

G 2 (df) = 2 log [Likelihood (A) / Likelihood (C)] 

Here, Likelihood [.] represents the highest likelihood estimation of the 
parameters of the model; df is the difference between parameter numbers estimated 
in the compact model and augmented model (Thissen, Steinberg andWainer,1993). In 
the likelihood proportion statistics for IRT-LR and DIF, the null hypothesis states 
there is no significant difference between item parameters estimated from two 
groups. When all parameters are equal that estimated from reference and focus 

groups, the value of G 2 cannot exceed 3.84 (sd=l, a =0.05for X 2 distribution). Thus, 
if the G 2 value exceeds 3.84, the item which considers with DIF (Thissen, 2001). The 
IRTLRDIF v.2.0b program (Thissen, 2001) was used to determine whether items in 
the PISA 2006 science literacy test of American and Turkish groups involved DIF 
according to the IRT-LR method. 
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Findings and Results 

Equivalence of Test Structure. After the equivalence of PISA 2006 science literacy 
test in Turkish and American samples was detected by the EFA, it was presented by 
CFA according to chi-square value and goodness of fit statistics for each group and 
test booklet. These results are given in Table 1. 


Table 1 


Goodness of Fit Statistics for TURKISH and AMERICAN Samples and Test Booklets 




TUR 


USA 


Statistics 





Range for good fit 


Booklet 1 

Booklet 5 

Booklet 1 

Booklet 5 

Indices* 

X 2 

770.58 

931.15 

1139.54 

907.43 


df 

945 

1080 

1484 

1325 

X 2 df =<2 

P 

0.99 

0.99 

1.00 

1.00 

p> .05 

RMSEA 

0.00 

0.000 

0.00 

0.000 

RMSEA<0.05 

AGFI 

0.91 

0.90 

0.91 

0.92 

AGFI> 0.90 

GFI 

0.92 

0.91 

0.91 

0.92 

GFI> 0.90 

CFI 

1.00 

1.00 

1.00 

1.00 

CFI> 0.90 

RMR 

0.074 

0.076 

0.066 

0.062 

RMR <0.05 

NFI 

0.75 

0.74 

0.77 

0.82 

NFI> 0.90 


*(Joreskog and Sorbon,1993; Kelloway, 1998) 


As can be seen in Table 1, the value y 2 / df should be showing unidimensionality 
of booklets 1 and 5 in Turkish and American groups was non-significant. For the 
acceptability of a model, the y 2 value is generally required to be non-significant 
(Tabachnick & Fidel, 2007). Accordingly, the model was accepted for both groups, so 
the unidimensional structure existed in both cases. In addition, the RMSEA, AGFI, 
GFI, RMR and CFI values show that data in both groups are unidimensional. 

MCFA was conducted to determine whether the factor structures of tests differed 
between the Turkish and American Samples. This analysis (Maximum Likelihood- 
ML) used a covariance matrice since the sample was small and the data was 
normally distributed. After calculating covariance matrices for each group 
separately, MCFA was conducted. Three different MCFA models were applied to 
Booklet 1 data. Model A was applied to determine the equivalence of factor loads, 
inter-factors correlations and error variances. The results showed that chi-square 
significance level was not appropriate for three dimensional model. Model B was 
applied, assuming that correlation between factors and error variances were 
invariable by releasing the values about factor loads to determine which dimension 
produced the difference between groups. Model B worked better, since the difference 
was significant at .05 level when comparing Model A and B. However, the model 
again gave poor fit values to the data. Model C was applied, in which inter-factor 
correlations were kept held constant by allowing error variances in addition to factor 
load values to differ in both groups. Significance tests of the difference between Model 
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B and C at 0.05 level showed that Model C performed better. Also, considering p 
likelihood value and goodness of fit values, the model has acceptable goodness of fit, 
as shown in Tables 2 and 3. 


Table 2 

Results for Booklet 1 MCFA 


Booklet 1 

X 2 

df 

P 

RMSEA 

Model A (factor values, inter-factors and 
error variances are equal) 

4800.80 

1560 

0.00 

0.071 

Model B (equivalence of inter-factors 
with error variances) 

4750.71 

1530 

0.00 

0.072 

Model C (invariance of inter-factor 
correlation) 

3200.76 

1490 

0.00 

0.053 

Table 3 

Model Comparison for Booklet 1 

Model Comparison 


X 2 


df 

Model A - Model B 


50.09 


30 

Model B - Model C 


1549.95* 


40 


*p<.01 


According to these results, the factor load values and error variances are different in 
both groups but factor structures in both groups are the same in terms of inter-factor 
correlations. 

Considering MCFA Booklet 5 and equivalence of factor values, inter-factors 
correlation and error variances of both groups. Table 4 shows that chi-square 
significance level and other fit values fit the data well. Consequently, all three models 
showed that the factor structure of booklet 5 data was the same between the Turkish 
and American samples. According to these results, it was concluded that there was 
generally a unidimensional structure and that factor structures were equivalent 
between cultures. 


Table 4 

Results for Multiple Group Confirmatory Factor Analysis 


Statistics 


TUR-ABD 

Range for Good fit indices* 


Booklet 1 

Booklet 5 


X 2 

3200 

1343.52 


df 

1490 

1806 

- / bU. 

RMSEA 

0.053 

0.00 

0.08<RMSEA < 0.05- 

GFI 

0.77 

- 

GFI> 0.90 

CFI 

- 

1.00 

CFI> 0.90 

RMR 

0.06 


0.08<RMR < 0.05 

NFI 

- 

0.81 

NFI> 0.90 


* (Joreskog and Sorbon,1993; Kelloway, 1998 ) 
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Differential Item Functioning. Analysis the above tables show the results of the 
analysis conducted with three methods in order to determine whether the items of 
PISA 2006 scientific literacy test show intercultural DIF in USA and Turkish groups. 
All DIF statistics were interpreted at a significance level of a= 0.05. The items 
showing DIF at B and C levels were taken as DIF, because DIF at levels B and C 
determine potential bias of the test more sensitively than level A (Gierl et al., 1999; 
Gotzmann, 2002; C^et, 2006; Gotzmann et al., 2006). 

DIF Analysis by Mantel-Haenszel Method. MH analyses are given in Table 5. 

Table 5 


DIF Analysis by MH Method According to Turkish and American Groups 




Items Numbers/DIF Level 



B 

C 

tJ 

In favor of Turkish group 

5,13,19,33,36,38,44,45, 
49,53 

6,12,41 

o ^ 
o 
£Q 

In favor of American group 

4,10,15,18,28,29,39,40, 

42, 57, 58 

17,20,37,56 

Booklet 

5 

In favor of Turkish group 

2,11,14,20,25,43,54,55, 

58,59 

12,23,29,33,39,48 

In favor of American group 

6,22, 

35,36,40,41,49,57 

3,5, 8,16,46,47,52, 60 


Examining Table 5, it is seen that 21 of 58 items in booklet 1 show DIF at level B, 
i.e., at medium level, and 7 items show DIF at level C, i.e., at high level. Of the items 
showing DIF at B and C levels, 13 were found to be in favor of Turkish students 
while 15 items were in favor of American students. Table 5 shows that 18 of 60 items 
in booklet 5 show DIF at level B, while 14 items show DIF at level C. The results 
indicate that 16 items showing DIF at levels B and C were in favor of Turkish 
students, while 16 items were in favor of American students. 

DIF Analysis by SIBTEST Method. Table 6 shows results for items showing DIF as a 
result of SIBTEST analysis. 

Table 6 


Results of DIF Analysis Turkish and American Groups Via SIBTEST Method 




Items Numbers/DIF Level 



B 

C 

<U 

3 r-l 

In favor of Turkish group 

11,16, 23, 24, 44, 
47, 56, 

5,12,13,19, 33, 36, 38,41,49,53, 
57,58 

o ^ 
o 
£Q 

In favor of American group 

8,10, 27, 

4, 6,15,17,18,20,28,29,37,39, 
45, 

Booklet 

5 

In favor of Turkish group 

9,15,27,53, 58 

2,11,12,14,20,23,25,29,33, 

39,43,48,54,55 

In favor of American group 

1,49 

3,5,6,8,16,22,35,36,38,40,41, 
46,47, 52, 57, 59, 60 
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Table 6 shows that 10 of 58 items in booklet 1 showed DIF at level B, while 23 
items showed DIF at level C. Of the items showing DIF at levels B and C, 19 were in 
favor of Turkish students while 14 items were in favor of American students. Of the 
60 items in booklet 5, seven involve DIF at level B while 31 items involve DIF at level 
C. Among the items showing DIF at levels B and C, 19 were in favor of Turkish 
students while anotherl9 worked in favor of American students. 

DIF Analysis by IRT-LR Method. As a result of performing MH and SIBTEST 
methods, three items that did not show DIF in either of the booklets were taken as 
"anchor" items, comprising: items 1, 2 and 3 in booklet 1 and the items 1, 4 and 7 in 
booklet 5. The results of IRT-LR analysis of items including DIF are given in Table 7. 


Table 7 


Results of DIF Analysis by IRT-LR Method according to Turkish and American Groups 




Items Numbers/DIF Level 




B 

C 

<u 

In favor of Turkish group 

6,12, 41, 


o ^ 
o 
£Q 

In favor of American 

group 

4,10,15,17,18,21,28,29,37,39, 

40, 42, 49, 56, 57 

20 

r T/ . _ 

In favor of Turkish group 

2,12,14,29, 33,48,57,58, 

- 

o 2 

O QJ 

CQ 

In favor of American 

group 

5,6, 8,16, 20, 23, 35,36,38,39,40, 

41, 46, 47, 52, 59, 60 

- 


As sees in Table 7,18 of 58 items in booklet 1 showed DIF at level B while 1 item 
showed DIF at level C. Three items showing DIF at levels B and C were in favor of 
Turkish students while 16 items were in favor of American students. Of the 60 items 
in booklet 5, it was found that 25 showed DIF at level B and no item showed DIF at 
level C. Eight items showing DIF at level B were in favor of Turkish students while 
17 items were in favor of American students. 

Items were accepted as DIF, if item has DIF at level B and C for each of the three 
methods. Table 8 presents DIF items in booklet 1 according to group, and 
distributions according to competencies evaluated by PISA 2006 and item formats. 

Table 8 

Distributions of Items in Booklet 1 that including DIF for Turkish and LISA groups 
according to the item content, measured skill by item and item format 


Item 

# 


Items 

Competencies 

Item 

Format 

Group 

Favor 

4 

S213Q01T 

Clothes 

ISI 

CMC 

TUR 

6 

S269Q01 

Earth's Temperature 

EPS 

OR 

USA 

10 

S326Q02 

Milk 

USE 

OR 

USA 

12 

S326Q04T 

Milk 

EPS 

MC 

TUR 



Eurasian Journal of Educational Research 


51 


1 

Table 8 Continue 

Item 

# 


Items 

Competencies 

Item 

Format 

Group 

Favor 

15 

S408Q04T 

Wild Oat Grass 

EPS 

CMC 

USA 

17 

S415Q02 

Solar Panels 

EPS 

MC 

USA 

18 

S415Q07T 

Solar Panels 

ISI 

MC 

USA 

20 

S416Q01 

The Moon 

USE 

OR 

USA 

28 

S426Q05 

Grand Canyon 

EPS 

MC 

USA 

29 

S426Q07T 

Grand Canyon 

ISI 

MC 

USA 

37 

S485Q02 

Asit Rain 

EPS 

OR 

USA 

39 

S485Q05 

Asit Rain 

ISI 

OR 

USA 

41 

S493Q03T 

Physical Exercise 

EPS 

MC 

TUR 

49 

S510Q01T 

Magnetic Hovertrain 

EPS 

MC 

TUR 

56 

S527Q01T 

Extinction of Dinosaurs 

USE 

MC 

TUR 

57 

S527Q03T 

Extinction of Dinosaurs 

EPS 

MC 

USA 


Note. Competencies: ISI = Identify scientific issues, EPS= Explain phenomena 
scientifically, USE= Use scientific evidence. Item format: OR= Open-constructed 
response, MC= Multiple-choice, CMC= Complex multiple-choice 

Table 8 shows that 16 items in booklet 1 showed DIF, representing27.6% of items 
in the booklet. Five of the items showing DIF worked in favor of Turkish students 
while 11 items worked in favor of American students. Table 9 shows DIF items in 
booklet 5 according to group, and distributions according to competencies and item 
formats. 


Table 9 


Distributions of Items in Booklet 5 that including DIF for Turkish and USA groups 
according to the item content, measured skill by item and item format 


Item 

# 


Items 

Competencies 

Item 

Format 

Group 

Favor 

2 

S131Q04T 

Good Vibration 

ISI 

OR 

TUR 

5 

S256Q01 

Spoons 

EPS 

MC 

USA 

6 

S268Q01 

Algae 

ISI 

MC 

USA 

8 

S268Q06 

Algae 

EPS 

MC 

USA 

12 

S304Q03B 

Water 

EPS 

OR 

TUR 

14 

S413Q05 

Plastic Age 

USE 

MC 

TUR 

16 

S416Q01 

The Moon 

USE 

OR 

USA 

20 

S425Q03 

Penguin Island 

EPS 

OR 

TUR 

23 

S428Q01 

Bacteria in Milk 

USE 

MC 

TUR 

29 

S447Q02 

Sunscreens 

ISI 

MC 

TUR 

33 

S458Q01 

The Ice Mummy 

EPS 

MC 

TUR 

35 

S465Q01 

Different Climates 

USE 

OR 

USA 

36 

S465Q02 

Different Climates 

EPS 

MC 

USA 

39 

S466Q05 

Forest Fires 

USE 

MC 

TUR 
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Table 9 Contiue 


Item 

# 


Items 

Competencies 

Item 

Format 

Group 

Favor 

40 

S466Q07T 

Forest Fires 

ISI 

MC 

USA 

41 

S477Q02 

Mary Montagu 

EPS 

MC 

USA 

46 

S478Q03T 

Antibiotics 

EPS 

MC 

USA 

47 

S493Q01T 

Physical Exercise 

EPS 

MC 

USA 

48 

S493Q03T 

Physical Exercise 

EPS 

MC 

TUR 

52 

S498Q04 

Experimental Digestion 

USE 

OR 

USA 

57 

S519Q02T 

Airbags 

EPS 

CMC 

USA 

58 

S519Q03 

Airbags 

ISI 

OR 

TUR 

59 

S524Q06T 

Penicillin Manufacture 

USE 

MC 

USA 

60 

S524Q07 

Penicillin Manufacture 

USE 

OR 

USA 


Note. Competencies: ISI = Identify scientific issues, EPS= Explain phenomena 
scientifically, USE= Use scientific evidence. Item format: OR= Open-constructed 
response, MC= Multiple-choice, CMC= Complex multiple-choice 

Table 9 shows that 24 items in booklet 5 show DIF, representing40% of items in 
the booklet. Ten of the items worked in favor of Turkish students while 14 worked in 
favor of American students. Since two of these items were common in both booklets, 
it was concluded that 38 items showed DIF. 

Possible Source of DIF in Turkish and American Groups. One of the methods used to 
determine the source of DIF involve sex pert opinion (Ercikan, 2002; £et, 2006; Bakan 
Kalaycioglu, 2008). A total of 38 items showed DIF, of which 9 were explained at 
international level. Five science teachers and three assessment experts' opinions were 
surveyed for these items, results were shown in Table 10. 


Table 10 

Distribution of Experts' Opinions about the Source of the Bias 

Item content and item number 



Cultural xx xxxxxx xxxxxx xxx xxxxxx x x 26 

unfamiliarity x 

with the content 

The word or x 1 

expression used 

for the item has 

different 

meaning in 

cultures 
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Table 10 Continue 

Item content and item number 


Possible Source m 

of Bias 5 

o 
U 


c 

o 

c 

03 

U Io 
T3 
C 

03 

u 

o 


c 

O 

c 

03 

U 

T3 

G 

o3 

5-1 

o 


(N 

.g 


LO 

.s 

"to 


£ oi 
.y cn 

*0 

5h 

X Ol 
pL, X 

w 


QJ 


a 

X O) 

pL, X 

w 


G 

3 

CD 


3 

WD 


03 


c 

I i 

5 w> 

_f0 

O mh 

H o 


The country xxx 

groups become 
more familiar 

with the item 

format 



xxx 


X X 


8 

The skills 

XX 

XX 

xxxx 

X 

xxxx 

X 

16 

measured within 






X 


the item are 
familiar to the 

relevant culture 






X 


Other 





X 

X 

3 







X 


Total number of 5 

judgments 

9 

8 

7 4 

7 

7 1 

6 
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Examining Table 10, some of the experts did not express an opinion about all of 
the items, whereas others provided two possible sources of bias for one item. The 
most important source of bias was regarded as "cultural unfamiliarity with the 
content" (26 judgments) and "the skills measured within the item are familiar to the 
relevant culture" (16 judgments). Another source of bias was regarded as "country 
groups becoming more familiar with the item format" (8 judgments). 

Tables 9 and 10 showed the distributions of items determined as showing DIF 
according to evaluated competency to determine whether competency evaluated in 
PISA 2006 affected DIF. Examining Tables 9 and 10, it can be seen that 8 of 12 items 
about using scientific evidence worked in favor of American students; 13 of 20 items 
about the processes for explaining cases scientifically worked in favor of American 
students. It was determined that there was no difference between Turkish and 
American student groups in terms of items for distinguishing scientific situations. 

The effect of differences in item format on DIF was determined by considering the 
distributions of item formats showing DIF. According to Tables9 and 10, in both of 
the booklets, 14 of 24 multiple choice items having DIF worked in favor of American 
students while 10 worked in favor of Turkish students; 9 of 13 open-ended items 
worked in favor of American while 4 worked in favor of Turkish students. 
Accordingly, although there was not a significant difference between two groups in 
terms of multiple-choice items, open-ended items provided advantages for American 
students. 
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Conclusions and Recommendations 

A single factor structure (science literacy) was detected for both booklets 
following exploratory factor analysis (EFA) of structure equivalence of PISA 2006 
science literacy tests conducted in Turkish and American groups. Moreover, the 
existence of a single factor structure was supported by CFA conducted on data for 
booklets 1 and 5 completed by Turkish and American groups. Similarly, CFA was 
used in international studies to determine factor structures between groups and 
unidimensionality (Gierl, 2000; Ercikan & Kim 2005; (Jet, 2006; Yildirim & 
Berberoglu, 2009). Similarly, a previous study of PISA 2003 also presented a single 
factor structure for Turkish and American groups ((Jet, 2006; Yildirim, 2008; Yildirim 
& Berberoglu, 2009). 

MCFA of the test structures showed differences between Turkish and American 
groups according to culture, factor loads and error variances of items in booklet l,but 
factor structures were the same for both groups in terms of inter-factors correlations. 
Equivalence of "factor values, inter-factors correlation and error variances" of both 
groups was presented in booklet 5. Consequently, it was decided that the factor 
structures of both booklets were equivalent in Turkish and American applications of 
PISA 2006.This finding differs from that of a previous PISA 2003 study ((Jet, 2006), 
which shows difference between translated forms and original form (i.e.,the 
measured structure was different) between Turkish and American groups. 

MH, SIBTEAST and IRT-LR analysis showed that DIF at levels B and C in booklet 
lforl6 (28%) items, and 24 (40%) items in booklet 5 by all three methods. Of these 
items, 15 worked in favor of Turkish students while 25 worked in favor of American 
students. However, since two of these items were common to both booklets, 38 items 
were found to show DIF in total. Previous studies of Turkish and American data for 
the PISA 2003 Mathematics literacy test found that different number of items had 
DIF ((Jet, 2006; Yildirim, 2006; Yildirim, 2008; Yildirim & Berberoglu, 2009). 

Expert opinions were sought on 9 items showing DIF in the present study. The 
expert responses suggested that bias originated in: cultural familiarity, being familiar 
with the item content and the skills measured by the item. Similarly, cultural 
difference was reported as a source of bias in large-scale international studies (Gierl 
& Khaliq; 2001; Ercikan, 2002; Ercikan, Gierl, Me Creith, Puhan & Koh, 2004). 

Among the processes evaluated in the PISA 2006 science literacy test, it was 
detected that items about differentiating scientific situations and explaining events 
scientifically were advantageous to the American group compared to the Turkish 
group, but there was no difference between the groups in items related to usage of 
scientific evidence. Comparing the two groups according to item formats, two-thirds 
of the open-ended items showing DIF were found to favor American students. For 
multiple-choice items, there was a small difference in favor of American students, 
but this difference was not significant. 

The study findings showed that some items in PISA 2006 science literacy tests 
showed DIF in favor of Turkish students while others favored American students. 
The results were not of a sufficient scale to affect the average student performance. 
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but in such international evaluation studies, presenting sources of bias due to 

descriptive analysis of item scopes will be beneficial for the participant countries 

where preliminary test of items are conducted. 
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Ozet 

Problem Durumu: Madde yanliligi, aym yetenek dtizeyinde olduklan halde bir 
maddenin dogru yanitlanma olasiligmi, bir gruptaki bireylerin diger grupta yer alan 
bireylerden daha az dogru yamtlama olasiligi bulunmasidir. Maddenin yanlilik 
taijimasi durumda testle ya da maddeyle, olgiilen ozelligin degeri, sistematik olarak 
oldugundan daha dii§iik ya da daha ytiksek elde edilir. Bu nedenle test puanlarma 
dayali olarak verilecek kararlarm isabetliligi bakimmdan test geli§tirme ve test 
uyarlama siirecinde maddelerin olasi yanlilik §tiphesine kar§i smanmasi gerekir. 
Klasik test kurann (KTK) ve madde tepki kuramma (MTK) gore madde yanliligi 
belirlemede farkli yontemler kullanilmaktadir. Klasik test kurami gergevesinde 
birgok ara§tirmaci, madde yanliligim, gruplar arasmda madde-aritmetik ortalama ya 
da madde-test korelasyonu gibi klasik madde istatistikleriyle kar§ila§tirma yaparak 
ara§tirmaktadir. MTK literatiirunde madde yanliligi kavrann, madde i§lev farkliligi 
(MIF) (DIF:Differential Item Functioning) olarak ifade edilir. Madde yanliligi 
analizlerini MTK ile yapmak; bu iki gruptan kestirilen madde parametrelerinin 
degerlerinin ve bu maddeye ait iki gruptan kestirilen madde karakteristik egrileri 
(MKE-Item Characteristic Curve-ICC) arasmdaki alanlarm kar§ila§tirilmasidir. Bir 
test maddesinin madde karakteristik egrileri referans ve odak gruplar igin aym 
olmadigmda madde her iki grupta aym bigimde olgmiiyor, diger bir ifayle MIF 
gosteriyor demektir. Ara§tirmalarda genelde bu yontemlerin birkagi birlikte 
kullamlir. Bir testte MIF'in varligim belirlemek igin yapilan ara§trrmalarda, farkli 
yontemlerin kullamldigi durumlarda yontemlere gore MIF'li olarak belirlenen 
maddelerin farkli oldugu goriilebilmektedir. Bundan dolayi MIF belirlemek igin tek 
bir yontem kullanmak yerine birden fazla yontemi kullanarak ara§tirma yapmak ve 
birden fazla yontemde MIF §iiphesi gosteren maddeleri incelemeye aimak, yanli 
maddelerin belirlenmesinde daha giivenilir sonug vermektedir. Bu ara§tirmada da iig 
farkli yontem kullamlarak PISA 2006 fen okuryazarligi testi maddelerinde yanlilik 
olup olmadigi ara§tirilmi§tir. PISA uygulamasi, diinyada politika geli§tirenlerin 
egitim politikalarmi yonlendirmede en gok dikkate aldiklari gali§malardan biridir. Bu 
ara§tirma ile Tiirkiye'de uygulanan PISA 2006 fen okuryazarligi testinde yer alan 
maddelerin herhangi bir yanlilik §iiphesi bulundurup bulundurmadigi 
ara§tirilmi§tir. PISA uygulamalarmda kullamlan testlerin orijinali Ingilizce dilinde 
hazirlanmakta ve her katilimci iilkenin diline gevrilmektedir. Bu nedenle bu tiir 
uygulamalarda maddelerde yanliliga yol agabilecek en onemli kiiltiirel unsur dildir. 
Ara§tirmada PISA 2006 fen okuryazarligi testini, testlerin hazirlandigi orijinal dil 
olan Ingilizce dilinde alan iilkelerden ABD'nin verileri kullamlmi§tir. 
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Ara§tirmanin Amaci: Bu ara§tirmada, PISA 2006 gah§masi fen bilimleri okuryazarligi 
testi'nin Turk ve ABD ogrenci gruplarmda yapi bakimmdan e§degerliginin 
incelenmesinin yam sira, fen bilimleri okuryazarligi maddelerinin kiiltiirler arasi 
yanlilik gosterip gostermedigi ve varsa olasi yanlilik nedenlerinin ortaya konulmasi 
amaglanmi§tir. 

Ara§tirmamn Yontemi: Ara§tirma 757 Tiirk ve 856 ABD'li ogrencinin verileri ile 
gergekle§tirilmi§tir. Ara§tirmada kullanilan fen bilimleri okuryazarligi testinin Tiirk 
ve ABD gruplarmda yapi bakimmdan e§deger olup olmadiklarmi belirlemek igin 
verilere once agimlayici ve dogrulayici faktor analizi uygulannu§tir. Testlerin Tiirk ve 
ABD gruplarma gore faktor yapismm kiiltiirlere gore farkliliga sahip olup olmadigmi 
belirlemek igin ise goklu grup dogrulayici faktor analizi [((JGDFA) (multi group 
confirmatory factor analysis)] yapilmi§tir. PISA 2006 gali§masmda fen bilimleri 
okuryazarligi testindeki maddelerde yanlilik olup olmadigmm belirlenmesinde ise 
Mantel-Haenszel (MH), Simultaneous Item Bias Test (SIBTEST) ve madde tepki 
kurami olabilirlik oram analizi (MTK-OOA) yontemleri kullanilmi§tir. MIF 
belirlemede kullanilan yontemlerin analizleri sonucunda B ve C dtizeyinde MIF 
gosteren maddeler MIF'li olarak almmi§tir. 

Ara§tirmanm Bulgulan: Ara§tirmada (JGDFA sonuglarma gore PISA 2006 fen 
okuryazarligi testinin Tiirk ve ABD versiyonlannda her iki kitapgigm faktor yapilari 
hakkmda e§deger oldugu karari verilmi§tir. Analizler sonucunda her tig yontemle 
ortak olarak B ve C dtizeyinde MIF gosteren madde sayismm 1 nolu kitapgikta 16 
(%28), 5 nolu kitapgikta 24 (%40) oldugu belirlenmi§tir. Bu maddelerin 15'i Tiirk 
ogrenciler, 25'i ise ABD'li ogrenciler lehine gali§mi§tir. Yanlilik kaynagmi belirlemek 
igin alinan uzman gorti§lerine gore; maddelerde genelde kiiltiire bagli olarak, 
maddenin igerigine a§ina olma ve madde kapsammda olgiilen becerilerin ilgili 
kiiltiire tamdik olma konularmm yanlilik kaynagi oldugu ortaya gikmi§tir. PISA 2006 
fen okuryazarligi testinde degerlendirilen siireglerden, bilimsel durumlari ayirt etme 
ve olgulari bilimsel olarak agiklama ile ilgili maddelerin Tiirk grubuna gore ABD'li 
gruba avantaj sagladigi belirlenmi§tir. Bilimsel kamtlari kullanma ile ilgili 
maddelerde iki iilke grubu arasmda bir farklihk saptanmami§tir. Sonugta iig 
yontemle yapilan analizlerin ortak sonuglarma gore MIF'li oldugu belirlenen 
maddelerin madde formati ve konu alani agismdan hangi gruba avantaj sagladigi 
gok net olarak ortaya konulmami§tir. 

Ara§tirmanm Sonuglan ve Onerileri: Ara§tirmanin sonucunda farkli yontemlerle MIF 
gosteren maddelerin farkli sayida oldugu belirlenmi§tir. Her iig yontemle MIF 
gosterdigi belirlenen 40 madde MIF'li olarak kabul edilmi§tir. Uzman gorti§lerine 
gore, bu maddelerden agiklanmi§ olanlarda, gozlenen olasi yanlilik nedenlerinden, 
kiiltiire bagli olarak maddenin igerigine ve olgttigii becerilere a§ma olmanm one 
giktigi belirlenmi§tir. Maddelerde gozlenen olasi yanliligin nedenlerinin katilimci 
iilkelerdeki ortalama ogrenci performansmm degerini degi§tirecek dtizeyde olmadigi 
sonucuna ula§ilmi§tir. Bununla birlikte bu tiirden uluslararasi degerlendirme 
gali§malarmda maddelerin on denemelerinin yapildigi katilimci iilkelerde, madde 
kapsamlarimn betimsel analizlerle olasi yanlilik kaynaklarmm ortaya konmasi 
yararli olacaktir. 

Anahtar Sdzciikler: PISA, madde i§lev farkliligi, Mantel-Haenszel, MTK olabilirlik 
oram analizi, SIBTEST 



